Exploratory Data Analysis

Motivation and Objectives

Exploratory Data Analysis (EDA) serves as the critical foundation for understanding the complex clinical landscape of cancer patient data before building predictive models. Our analysis focuses on two major cancer types affecting women: Breast Invasive Carcinoma (BRCA) and Cervical Squamous Cell Carcinoma (CESC) from The Cancer Genome Atlas (TCGA).

Why EDA is Essential for Cancer Data

Cancer datasets present unique analytical challenges that make thorough exploration crucial:

  1. Multi-dimensional Clinical Complexity: Cancer progression involves intricate relationships between patient demographics, tumor characteristics, staging systems (AJCC, FIGO), treatment modalities, and survival outcomes that required systematic investigation.

  2. Data Quality Assessment: EDA was also a process enabling us to assess the effectiveness of our data cleaning stage.

  3. Feature Engineering Insights: Understanding distributions and relationships helped identify opportunities for creating meaningful derived features (e.g., ordinal encoding of cancer stages, treatments offered).

  4. Model Selection Guidance: EDA revealed whether relationships are linear or non-linear, helping inform appropriate algorithm choices for subsequent supervised and unsupervised learning tasks.

Research Questions Driving Our Analysis

Our EDA is designed to answer key questions that will inform our modeling strategy:

  • Survival Patterns: How do survival times vary between BRCA and CESC patients? What are the distributional characteristics of our regression target?
  • Staging Relationships: How do different staging systems (AJCC pathologic staging, FIGO staging for cervical cancer) relate to patient outcomes?
  • Treatment Impact: What treatment patterns exist, and how do they correlate with survival outcomes?
  • Feature Relationships: Which clinical variables show the strongest associations with survival time and each other?
  • Data Completeness: Where are the gaps in our data, and how might they impact model performance?

Expected Outcomes

Through systematic exploration, we aimed to:

  • Identify the most informative features for survival prediction
  • Detect potential confounding variables or selection biases
  • Establish baseline expectations for model performance
  • Generate hypotheses about cancer survival mechanisms for validation in supervised learning

Breast (BRCA) Cancer Exploratory Data Analysis (EDA)

BRCA Dataset Overview:
Shape: (4920, 38)

Columns: ['project.project_id', 'cases.case_id', 'cases.disease_type', 'cases.index_date', 'cases.primary_site', 'cases.submitter_id', 'demographic.age_is_obfuscated', 'demographic.days_to_death', 'demographic.ethnicity', 'demographic.gender', 'demographic.race', 'demographic.submitter_id', 'demographic.vital_status', 'diagnoses.age_at_diagnosis', 'diagnoses.ajcc_pathologic_m', 'diagnoses.ajcc_pathologic_n', 'diagnoses.ajcc_pathologic_stage', 'diagnoses.ajcc_pathologic_t', 'diagnoses.classification_of_tumor', 'diagnoses.days_to_diagnosis', 'diagnoses.days_to_last_follow_up', 'diagnoses.diagnosis_is_primary_disease', 'diagnoses.laterality', 'diagnoses.method_of_diagnosis', 'diagnoses.morphology', 'diagnoses.primary_diagnosis', 'diagnoses.prior_malignancy', 'diagnoses.prior_treatment', 'diagnoses.site_of_resection_or_biopsy', 'diagnoses.sites_of_involvement', 'diagnoses.submitter_id', 'diagnoses.synchronous_malignancy', 'diagnoses.tissue_or_organ_of_origin', 'treatments.submitter_id', 'treatments.treatment_or_therapy', 'treatments.treatment_type', 'survival_time_days', 'diagnoses.behavior']

Univariate Analysis (Single Feature)

  • Frequency Counts: For categorical features (e.g., vital_status, ajcc_pathological_stage, classification_of_tumor, treatment_type, laterality, diagnoses.behavior), visualize frequency distribution (bar charts).
  • Age Distribution: Analyze the range and spread of age at diagnoses data (histograms, box plots).
  • Survival time distribution See the spread of survival_time_days (histograms, box plots).
  • Race distribution

Univariate Analysis Findings - Categorical - The breast cancer dataset is heavily imbalanced towards alive in the vital_status column, this makes sense as breast cancer has a high survival rate compared to many other cancers. - The behavior of the tumor is completely dominated by malignant tumors and for most patients, this is their first known malignant cancer diagnosis - Over 70% of the patients are white followed by almost 20% black/African American, this is reflective of the US population where breast cancer is most prevalent - As expected, most patients receive treatment with a majority receiving chemotherapy followed by surgery, and hormone therapy. - Last but not least, the disease type is ductal and lobular neoplasms and there is no difference in laterality (which breast)

Univariate Analysis Findings - Numerical - The average age at diagnoses is 56 years old with a minimum age of 26 and a maximum age of 89 years old. - The average survival time is 1324 days from diagnoses and ranges from 0 days to 8605 days

Bivariate Analysis (Two Features)

  • vital_status vs. survival_time_days: Explore how survival time varies by vital_status, infer average survival_time_days (box plot or scatter plot).
  • age_at_diagnoses vs. average survival_time_days: Analyze differences in average survival time by age_at_diagnoses .
  • treatment_type vs. survival_time_days: Check how treatment_type and survival_time_days are related.
  • treatment_type vs. vital_status: Check treatment_type s offered by vital_status.
  • ajcc_pathological_stage vs. survival_time_days: Explore how survival time varies by ajcc_pathological_stage.
  • Survival_time_days vs. race: Analyze survival time across different races.
  • Diagnoses behavior vs survival_time_days: Explore how survival time varies by diagnoses behavior.

Bivariate Analysis Findings - The survival time is slightly higher for dead patients compared to alive patients, this is likely due to many alive patients being recently diagnosed and have not had enough time to accumulate survival days. - Older patients tend to have lower survival times compared to younger patients, this is expected as older patients tend to have more comorbidities and a weaker immune system. - Patients who received immunotherapy tend to have higher survival times compared to other treatment types, this is likely due to immunotherapy being a more aggressive treatment option. The next type of treatment that shows higher survival times is chemotherapy. - Patients who opted for treatment tend to have higher survival times compared to those who did not receive treatment, this is expected as treatment is designed to improve patient outcomes. - There is lower survical times for patients with stage iiib as it is one of the more severe stages of breast cancer before metastasis (stage iv). Stage i and ib have the highest survival times as they are the least severe stages.

Multivariate Analysis (Multiple Features)

  • ajcc_pathological_stage vs. age_at_diagnoses vs. survival_time_days : Analyze how survival time varies across different stages and age groups (3D scatter plot or heatmap).
  • treatment_type vs. age_at_diagnoses vs. survival_time_days: Explore trends across treatment types, age groups, and survival times (grouped bar plots).
  • laterality vs. age_at_diagnoses vs. survival_time_days: Check if tumor laterality impacts survival time at different age groups.
  • diagnoses.behavior vs. age_at_diagnoses vs. survival_time_days: Compare survival times across diagnoses behavior and age groups.

Multivariate Analysis Findings - Generally, older patients tend to have lower survival times across all stages of breast cancer, with latter stages (stage iiib, iiic, iv)showing more pronounced decreases in survival time.
- Younger patients generally have higher survival times, especially if the diagnoses is at an early stage (stage i, ib, ii). - Stage x is when the tumor could not be assessed, the patients tend to have a higher survival time - Overall, getting diagnoses at an earlier stage is associated with better survival outcomes, regardless of age. - Treatment or no treatment, there is no difference in survival times for older patients (>70 years old). However, for younger patients (< 70 years old), those who received treatment tend to have higher survival times compared to those who did not receive treatment.

Text Analysis

  • Sites of involvement: Word cloud or frequency distribution of common sites mentioned.

Text Analysis Findings - Text Analysis Findings reveal not difference in breast side and a slightly higher frequency for the upper outer region.

Correlations and Associations

  • Correlation Matrix: Compute correlations between numerical features (e.g., age_at_diagnoses and survival_time_Days) to find relationships.
Correlation between Age at Diagnosis and Survival Time: -0.1989

There is a negative correlation between age at diagnoses and survival time days, indicating that as age at diagnoses increases, survival time days tends to decrease. This suggests that older patients may have poorer survival outcomes compared to younger patients.

cesc_df.head(1)
project.project_id cases.case_id cases.disease_type cases.index_date cases.lost_to_followup cases.primary_site cases.submitter_id demographic.days_to_death demographic.ethnicity demographic.gender demographic.race demographic.submitter_id demographic.vital_status diagnoses.age_at_diagnosis diagnoses.ajcc_pathologic_m diagnoses.ajcc_pathologic_n diagnoses.ajcc_pathologic_t diagnoses.classification_of_tumor diagnoses.days_to_diagnosis diagnoses.days_to_last_follow_up diagnoses.figo_stage diagnoses.figo_staging_edition_year diagnoses.method_of_diagnosis diagnoses.morphology diagnoses.primary_diagnosis diagnoses.prior_malignancy diagnoses.prior_treatment diagnoses.site_of_resection_or_biopsy diagnoses.submitter_id diagnoses.synchronous_malignancy diagnoses.tissue_or_organ_of_origin diagnoses.tumor_grade treatments.submitter_id treatments.treatment_or_therapy treatments.treatment_type survival_time_days diagnoses.behavior exposures.tobacco_smoking_status
0 tcga-cesc 00bca18c-b3d4-45a3-8f19-034cc40449a4 squamous cell neoplasms diagnosis yes cervix uteri tcga-c5-a2lv NaN not hispanic or latino female black or african american tcga-c5-a2lv_demographic alive 36.0 mx n1 t1b primary 0 2234.0 stage ib 1995 biopsy 8070/3 squamous cell carcinoma, nos False False cervix uteri tcga-c5-a2lv_diagnosis False cervix uteri g3 tcga-c5-a2lv_treatment3 yes hysterectomy, nos 2234.0 malignant current smoker

Cervical Cancer (CESC) Dataset Exploratory Data Analysis (EDA)

CESC Dataset Overview:
Shape: (872, 38)

Columns: ['project.project_id', 'cases.case_id', 'cases.disease_type', 'cases.index_date', 'cases.lost_to_followup', 'cases.primary_site', 'cases.submitter_id', 'demographic.days_to_death', 'demographic.ethnicity', 'demographic.gender', 'demographic.race', 'demographic.submitter_id', 'demographic.vital_status', 'diagnoses.age_at_diagnosis', 'diagnoses.ajcc_pathologic_m', 'diagnoses.ajcc_pathologic_n', 'diagnoses.ajcc_pathologic_t', 'diagnoses.classification_of_tumor', 'diagnoses.days_to_diagnosis', 'diagnoses.days_to_last_follow_up', 'diagnoses.figo_stage', 'diagnoses.figo_staging_edition_year', 'diagnoses.method_of_diagnosis', 'diagnoses.morphology', 'diagnoses.primary_diagnosis', 'diagnoses.prior_malignancy', 'diagnoses.prior_treatment', 'diagnoses.site_of_resection_or_biopsy', 'diagnoses.submitter_id', 'diagnoses.synchronous_malignancy', 'diagnoses.tissue_or_organ_of_origin', 'diagnoses.tumor_grade', 'treatments.submitter_id', 'treatments.treatment_or_therapy', 'treatments.treatment_type', 'survival_time_days', 'diagnoses.behavior', 'exposures.tobacco_smoking_status']

Univariate Analysis (Single Feature)

  • Frequency Counts: For categorical features (e.g., vital_status, figo_stage, tumor_grade, treatment_type, diagnoses.behavior), visualize frequency distribution (bar charts).
  • Age Distribution: Analyze the range and spread of age at diagnoses data (histograms, box plots).
  • Survival time distribution See the spread of survival_time_days (histograms, box plots).
  • Race distribution

Univariate Analysis Findings - Categorical - The cervical cancer dataset is heavily imbalanced towards alive in the vital_status column, this makes sense as cervical cancer also has a high survival rate compared to many other cancers. - The tumor grade is mostly dominated by grade ii and grade iii tumors, with very few patients having grade i tumors. - The stage is domniated by ib1 followed by iib and ib2, which are some of the less severe stages of cervical cancer. - Almost 70% of the patients are white followed by an approximately equal distribution of the rest of the other races, which is not reflective of the US population - As expected, most patients receive treatment with a majority receiving pharmaceutical therapy followed by radiation therapy. - Last but not least, the disease type is squamous cell neoplasm. All of the patients have malignant tumors and no prior malignant cancer diagnoses.

Univariate Analysis Findings - Numerical - The average age at diagnoses is 48 years old with a minimum age of 25 and a maximum age of 80 years old. - The average survival time is 1036 days from diagnoses and ranges from 2 days to 6408 days

Bivariate Analysis (Two Features)

  • vital_status vs. survival_time_days: Explore how survival time varies by vital_status, infer average survival_time_days (box plot or scatter plot).
  • age_at_diagnoses vs. average survival_time_days: Analyze differences in average survival time by age_at_diagnoses .
  • treatment_type vs. survival_time_days: Check how treatment_type and survival_time_days are related.
  • treatment_type vs. vital_status: Check treatment_type s offered by vital_status.
  • figo_stage vs. survival_time_days: Explore how survival time varies by ajcc_pathological_stage.
  • Survival_time_days vs. race: Analyze survival time across different races.
  • Tumor_grade vs. survival_time_days: Explore how survival time varies by tumor_grade.
  • Diagnoses behavior vs survival_time_days: Explore how survival time varies by diagnoses behavior.

Bivariate Analysis Findings - The survival time is slightly higher for alive patients which is expected as alive patients have had more time to accumulate survival days, unlike breast cancer where many alive patients are recently diagnosed. - Older patients tend to have higher survival times compared to younger patients, this is unexpected as older patients tend to have more comorbidities and a weaker immune system. - Patients who received radiation combination therapy tend to have higher survival times compared to other treatment types, this is likely due to combination therapy being a more aggressive treatment option. The next type of treatment that shows higher survival times is pharmaceutical therapy. - There is no difference in survival times for patients who opted for treatment compared to those who did not receive treatment, which is unexpected as treatment is designed to improve patient outcomes. - There is lower survical times for patients with stage iiib, which is one of the more severe stages of cervical cancer before metastasis (stage iv). Stage ib has the highest survival times as it is one of the least severe stages. - Stage ia2, ia1 have small sample size which might indicate need for earlier testing to catch cancer at these stages. There is no data for stage iii which might also suggest a need to shift in diagnoses to earlier stages.

Multivariate Analysis (Multiple Features)

  • Tumor_stage vs. age_at_diagnoses vs. survival_time_days : Analyze how survival time varies across different stages and age groups (3D scatter plot or heatmap).
  • treatment_type vs. age_at_diagnoses vs. survival_time_days: Explore trends across treatment types, age groups, and survival times (grouped bar plots).
  • figo_stage vs. age_at_diagnoses vs. survival_time_days: Check if stage impacts survival time at different age groups.
  • diagnoses.behavior vs. age_at_diagnoses vs. survival_time_days: Compare survival times across diagnoses behavior and age groups.

Multivariate Analysis Findings - Generally, earlier stages show higher survival times across all age groups. - The heatmap plot interpretation is obfuscated by missing data at stages but also younger patients will have lower survival times due to less time to accumulate survival days.

Correlations and Associations

  • Correlation Matrix: Compute correlations between numerical features (e.g., age_at_diagnoses and survival_time_Days) to find relationships.
Correlation between Age at Diagnosis and Survival Time: 0.1390

There is a positive correlation between age at diagnoses and survival time days, indicating that as age at diagnoses increases, survival time days tends to increase. This suggests that older patients may have better survival outcomes compared to younger patients but this is unexpected, potentially due to less survival days accumulated by younger patients.

Smoking Exposure Effect on survival time for CESC patients

  • smoking_exposure vs. survival_time_days: Analyze how smoking exposure impacts survival time (box plots or histograms).
Smoking Status Distribution in CESC Dataset:
exposures.tobacco_smoking_status
lifelong non-smoker                                553
current smoker                                     142
current reformed smoker for < or = 15 yrs           94
not reported                                        29
unknown                                             23
current reformed smoker for > 15 yrs                20
current reformed smoker, duration not specified     11
Name: count, dtype: int64

Total patients with smoking data: 872


============================================================
SMOKING EXPOSURE SURVIVAL ANALYSIS SUMMARY
============================================================

Insufficient data for statistical comparison between never smokers and current smokers

============================================================

There is a heavy imbalance towards lifelong non-smoker in the tobacco smoking status column which might indicate underreporting or misclassification of smoking status among cervical cancer patients. This results suggests that smoking status may not be a reliable indicator of cervical cancer risk in this dataset. From survival time by smoking status, lifelong non-smokers and unknowen tend to have higher survival times compared to current smokers and former smokers. However, there is insufficient data to draw definitive conclusions about the impact of smoking status on survival time in cervical cancer patients.

Summary of Findings

The exploratory analysis of both breast cancer (BRCA) and cervical cancer (CESC) datasets reveals important insights into patient demographics, tumor characteristics, treatments, and survival outcomes, highlighting both similarities and disease-specific differences. In both cancers, the vital status column is heavily skewed towards alive, reflecting the generally high survival rates for these cancers. However, survival dynamics differ when stratified by age, stage, and treatment type.

For breast cancer, patients’ average age at diagnosis is 56 years, with survival time averaging 1,324 days. Survival decreases with age, likely due to comorbidities and decreased physiological resilience. Stage-specific analysis shows that early-stage tumors (Stage I and IB) have the highest survival, while more advanced stages (Stage IIIB, IIIC, IV) exhibit lower survival. Interestingly, Stage X, indicating unassessable tumors, is associated with relatively higher survival, possibly reflecting early detection or missing stage data for indolent tumors. Treatment plays a clear role in outcomes; patients receiving immunotherapy or chemotherapy exhibit higher survival, particularly among those under 70 years, whereas older patients (>70) show minimal differences with or without treatment. Tumor type is largely ductal or lobular, and no laterality effect is observed, consistent with previous literature on breast cancer prognosis1.

Cervical cancer presents a younger cohort, with a mean age at diagnosis of 48 years and average survival of 1,036 days. The majority of tumors are squamous cell carcinomas, predominantly grade II or III, with stages dominated by IB1, IB2, and IIB. Unlike breast cancer, survival shows a slightly positive correlation with age, an unexpected pattern likely influenced by fewer accumulated survival days for younger patients and missing data. While treatment generally includes pharmaceutical therapy and radiation, survival differences between treated and untreated patients are negligible, suggesting either dataset limitations or variability in treatment efficacy. Notably, stage IIIB is associated with lower survival, reinforcing the established correlation between advanced stage and poorer outcomes2 .

Comparing both cancers, early-stage detection consistently correlates with better survival outcomes, underlining the importance of screening programs. While breast cancer survival is more sensitive to age and treatment, cervical cancer survival appears more influenced by tumor grade and stage, with age playing an unexpected role due to dataset-specific factors. Racial composition differs, with breast cancer patients predominantly white and African American, reflecting the U.S. population, whereas cervical cancer data shows less representativeness, potentially limiting generalizability.

Exposure analysis, particularly tobacco smoking status in cervical cancer, reveals a strong bias towards lifelong non-smokers, likely due to underreporting. This prevents definitive conclusions regarding the role of smoking in survival outcomes, although current and former smokers tend to show shorter survival times. In breast cancer, while similar exposure data were not highlighted, established literature suggests that lifestyle and hormonal factors play a role in risk and outcomes3 .

In summary, both breast and cervical cancer datasets emphasize the critical role of early detection and stage at diagnosis in determining survival. Treatment type is a more decisive factor for breast cancer outcomes than cervical cancer, and age interacts differently with survival in the two diseases. Exposure variables such as smoking require careful interpretation due to potential biases and missing data. These findings reinforce the need for comprehensive data collection and stratified analyses to improve prognostic modeling and targeted interventions across cancer types.

References:

  • Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2023. CA Cancer J Clin. 2023;73(1):17–48.
  • Arbyn M, Weiderpass E, Bruni L, et al. Estimates of incidence and mortality of cervical cancer in 2018: a worldwide analysis. Lancet Glob Health. 2020;8(2):e191–203.
  • Collaborative Group on Hormonal Factors in Breast Cancer. Breast cancer and hormonal contraceptives: collaborative reanalysis of individual data on 53,297 women with breast cancer and 100,239 women without breast cancer. Lancet. 2002;360:1040–1054.

Code

Code files for EDA can be found here

References

1.
Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics, 2023. CA Cancer J Clin 73, 17–48 (2023).
2.
Arbyn, M., Weiderpass, E., Bruni, L. & al., et. Estimates of incidence and mortality of cervical cancer in 2018: A worldwide analysis. Lancet Glob Health 8, e191–e203 (2020).
3.
Hormonal Factors in Breast Cancer, C. G. on. Breast cancer and hormonal contraceptives: Collaborative reanalysis of individual data on 53,297 women with breast cancer and 100,239 women without breast cancer. Lancet 360, 1040–1054 (2002).